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Abstract —Various studies that address the compressed 
sensing problem with Multiple Measurement Vectors 
(MMVs) have been recently carried. These studies assume 
the vectors of the different channels to be jointly sparse. 
In this paper, we relax this condition. Instead we assume 
that these sparse vectors depend on each other but that 
this dependency is unknown. We captnre this dependency 
by computing the conditional probability of each entry 
in each vector being non-zero, given the “residuals” of 
all previous vectors. To estimate these probabilities, we 
propose the use of the Long Short-Term Memory (LSTM) 
11 1, a data driven model for sequence modelling that is deep 
in time. To calculate the model parameters, we minimize 
a cross entropy cost function. To reconstruct the sparse 
vectors at the decoder, we propose a greedy solver that uses 
the above model to estimate the conditional probabilities. 
By performing extensive experiments on two real world 
datasets, we show that the proposed method significantly 
outperforms the general MMV solver (the Simnltaneous 
Orthogonal Matching Pursuit (SOMP)) and a number of 
the model-based Bayesian methods. The proposed method 
does not add any complexity to the general compressive 
sensing encoder. The trained model is used just at the 
decoder. As the proposed method is a data driven method, it 
is only applicable when training data is available. In many 
applications however, training data is indeed available, e.g. 
in recorded images and videos. 

Index Terms —Compressive Sensing, Deep Learning, 
Long Short-Term Memory. 


I. Introduction 

C OMPRESSIVE Sensing (CS) 111 , 0,111 is an ef¬ 
fective approach for acquiring sparse signals where 
both sensing and compression are performed at the same 
time. Since there are numerous examples of natural 
and artificial signals that are sparse in the time, spatial 
or a transform domain, CS has found numerous ap¬ 
plications. These include medical imaging, geophysical 
data analysis, computational biology, remote sensing and 
communications. 
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In the general CS framework, instead of acquiring N 
samples of a signal x G M random measure¬ 

ments are acquired where M < N. This is expressed by 
the underdetermined system of linear equations; 

y = $x (1) 

where y G is the known measured vector and $ G 

SjjMxlV 

is a random measurement matrix. To uniquely 
recover x given y and #, x must be sparse in a given 
basis ’®'. This means that 


X = ’®'s 


( 2 ) 


where s is K — sparse, i.e., s has at most K non-zero el¬ 
ements. The basis can be complete; i.e., G 3?^^^, 
or over-complete; i.e., G 3?^^^^ where N < Ni 
(compressed sensing for over-complete dictionaries is 
introduced in 0). Erom Q and 0: 

y = As (3) 


where A = 4>’4'. Since there is only one measurement 
vector, the above problem is usually called the Single 
Measurement Vector (SMV) problem in compressive 
sensing. 

In distributed compressive sensing , also known as the 
Multiple Measurement Vectors (MMV) problem, a set of 
L sparse vectors is to be jointly recovered 

from a set of L measurement vectors {yi}i=i,2,...,L- 
Some application areas of MMV include magnetoen¬ 
cephalography, array processing, equalization of sparse 
communication channels and cognitive radio 0. 

Suppose that the L sparse vectors and the L mea¬ 
surement vectors are arranged as columns of matrices 
S = [si,S2,...,si,] and Y = [yi, y2, ■ ■ ■, Vl] respec¬ 
tively. In the MMV problem, S is to be reconstructed 
given Y: 

Y = AS (4) 


In 0, S is assumed to be jointly sparse, i.e., non¬ 
zero entries of each vector occur at the same locations 
as those of other vectors, which means that the sparse 
vectors have the same support. Assume that S is jointly 
sparse. Then, the necessary and sufficient condition to 
obtain a unique S given Y is 171: 


\supp{S)\ < 


spark(A) — 1 -f rank(S) 


(5) 
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where |sMpp(S)| is the number of rows in S with non¬ 
zero energy and spark of a given matrix is the smallest 
possible number of linearly dependent columns of that 
matrix, spark gives a measure of linear dependency 
in the system modelled by a given matrix. In the 
SMV problem, no rank information exists. In the MMV 
problem, the rank information exists and affects the 
uniqueness bounds. Generally, solving the MMV prob¬ 
lem jointly can lead to better uniqueness guarantees than 
solving the SMV problem for each vector independently 

ij. 

In the current MMV literature, a jointly sparse matrix 
is recovered typically by one of the following methods; 
1) greedy methods El like Simultaneous Orthogonal 
Matching Pursuit (SOMP) which performs non-optimal 
subset selection, 2) relaxed mixed norm minimization 
methods ifTOl . or 3) Bayesian methods like ifTTI . ifT^ . 
ca where a posterior density function for the values of 
S is created, assuming a prior belief, e.g., Y is observed 
and S should be sparse in basis 'S'. The selection of 
one of the above methods depends on the requirements 
imposed by the specific application. 

A. Problem Statement 

The MMV reconstruction methods stated above do 
not rely on the use of training data. However, for many 
applications, a large amount of data similar to the data to 
be compressed by CS is available. Examples are camera 
recordings of the same environment, images of the same 
class (e.g., flowers, buildings, ....), electroencephalogram 
(EEG) of different parts of the brain, etc. In this paper, 
we address the following questions in the MMV problem 
when training data is available: 

1) Can we learn the structure of the sparse vectors in 
S by a data driven bottom up approach using the 
already available training data? If yes, then how 
can we exploit this structure in the MMV problem 
to design a better reconstruction method? 

2) Most of the reconstruction algorithms for the 
MMV problem rely on the joint sparsity of S. 
However, in some practical applications, the sparse 
vectors in S are not exactly jointly sparse. This 
can be due to noise or due to sources that create 
different sparsity patterns. Examples are images 
of different scenes captured by different cameras, 
images of different classes, etc. Although S is 
not jointly sparse, there may exist a possible de¬ 
pendency among the columns of S, however, due 
to lack of joint sparsity, the above methods will 
not give satisfactory performance. The question 
is, can we design the aforementioned data driven 
method in a way that it captures the dependencies 
among the sparse vectors in S? The type of such 


dependencies may not be necessarily that of joint 
sparsity. And then how can we use the learned de¬ 
pendency structure in the reconstruction algorithm 
at the decoder? 

Please note that we want to address the above ques¬ 
tions ‘'without adding any complexity or adaptability” 
to the encoder. In other words, our aim is not to design 
an optimal encoder, i.e., optimal sensing matrix $ or 
the sparsifying basis fP, for the given training data. The 
encoder would be as simple and general as possible. 
This is specially important for applications that use 
sensors having low power consumption due to a limited 
battery life. However, the decoder in these cases can be 
much more complex than the encoder. Eor example, the 
decoder can be a powerful data processing machine. 

B. Proposed Method 

To address the above questions, we propose the use of 
a two step greedy reconstruction algorithm. In the first 
step, at each iteration of the reconstruction algorithm, 
and for each column of S represented as s^, we first find 
the conditional probability of each entry of being non¬ 
zero, given the residuals of all previous sparse vectors 
(columns) at that iteration. Then we select the most 
probable entry and add it to the support of s^. The 
definition of the residual matrix at the j—th iteration is 
Rj = Y — ASj where Sj is the estimate of the sparse 
matrix S at the j—th iteration. Therefore in the first 
step, we find the locations of the non-zero entries. In the 
second step we find the values of these non-zero entries. 
This can be done by solving a least squares problem that 
finds s,; given and An^. Aq . is a matrix that includes 
only those atoms (columns) of A that are members of 
the support of s^. 

To find the conditional probabilities at each iteration, 
we propose the use of a Recurrent Neural Network 
(RNN) with Long Short-Term Memory (LSTM) cells 
and a softmax layer on top of it. To find the model 
parameters, we minimize a cross entropy cost function 
between the conditional probabilities given by the model 
and the known probabilities in the training data. The de¬ 
tails on how to generate the training data and the training 
data probabilities are explained in subsequent sections. 
Please note that this training is done only once. After 
that, the resulting model is used in the reconstruction 
algorithm for any test data that has not been observed by 
the model before. Therefore, the proposed reconstruction 
algorithm would be almost as fast as the greedy methods. 
The block diagram of the proposed method is presented 
in Eig. and Eig. We will explain these figures in 
detail in subsequent sections. 

To the best of our knowledge, this is the first model- 
based method in MMV sparse reconstruction that is 
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Fig. 1. Block diagram of the proposed method unfolded over channels. 


(Srrj) 



Fig. 2. Block diagram of the Long Short-Term Memory (LSTM). 


based on a deep learning bottom up approach. Similar 
to all deep learning methods, it has the important feature 
of learning the structure of S from the raw data auto¬ 
matically. Although it is based on a greedy method that 
selects subsets that are not necessarily optimal, we ex¬ 
perimentally show that by using a properly trained model 


and only one layer of LSTM, the proposed method sig¬ 
nificantly outperforms well known MMV baselines (e.g., 
SOMP) as well as the well known Bayesian methods 
for the MMV problem (e.g.. Multitask Bayesian Com¬ 
pressive Sensing (MT-BCSl lfl^ and Sparse Bayesian 
Learning for temporally correlated sources (T-SBL) lfT3l ). 
We show this on two real world datasets. 

We emphasize that the computations carried at the en¬ 
coder mainly include multiplication by a random matrix. 
The extra computations are only needed at the decoder. 
Therefore an important feature of compressive sensing 
(low power encoding) is preserved. 

C. Related Work 

Exploiting data structures besides sparsity for com¬ 
pressive sensing has been extensively studied in the 

literature Cl, fH, QS], Ell, ED, CD, IHl, ED, 
ESTl . ED, EQI, IED- In IHl, it has been theoreti¬ 
cally shown that using signal models that exploit these 
structures will result in a decrease in the number of 
measurements. In Q, a thorough review on CS methods 
that exploit the structure present in the sparse signal or 
in the measurements is presented. In E3, a Bayesian 
framework for CS is presented. This framework uses a 
prior information about the sparsity of s to provide a 
posterior density function for the entries of s (assuming 
y is observed). It then uses a Relevance Vector Machine 
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(RVM) II 22 I to estimate the entries of the sparse vector. 
This method is called Bayesian Compressive Sensing 
(BCS). In ifT^ . a Bayesian framework is presented for 
the MMV problem. It assumes that the L “tasks” in the 
MMV problem in Q, are not statistically independent. 
By imposing a shared prior on the L tasks, an empirical 
method is presented to estimate the hyperparameters 
and extensions of RVM are used for the inference step. 
This method is known as Multitask Compressive Sensing 
(MT-BCS). In ifl^ . it is experimentally shown that the 
MT-BCS outperforms the method that applies Orthogo¬ 
nal Matching Pursuit (OMP) on each task, the Simul¬ 
taneous Orthogonal Matching Pursuit (SOMP) method 
which is a straightforward extension of OMP for the 
MMV problem, and the method that applies BCS for 
each task. In El, the Sparse Bayesian Learning (SBL) 
1221, is used to solve the MMV problem. It was 
shown that the global minimum of the proposed method 
is always the sparsest one. The authors in ifTSl . address 
the MMV problem when the entries in each row of S are 
correlated. An algorithm based on SBL is proposed and 
it is shown that the proposed algorithm outperforms the 
mixed norm {£ 1 ^ 2 ) optimization as well as the method 
proposed in HD- The proposed method is called T- 
SBL. In m, a greedy algorithm aided by a neural 
network is proposed to address the SMV problem in 
0- The neural network parameters are calculated by 
solving a regression problem and are used to select the 
appropriate column of A at each iteration of OMR The 
main modification to OMP is replacing the correlation 
step with a neural network. They experimentally show 
that the proposed method outperforms OMP and £i 
optimization. This method is called Neural Network 
OMP (NNOMP). In ifTTl . an extension of m with 
a hierarchical Deep Stacking Netowork (DSN) ll24l is 
proposed for the MMV problem. “The joint sparsity of S 
is an important assumption in the proposed method”. To 
train the DSN model, the Restricted Boltzmann Machine 
(RBM) ||25]| is used to pre-train DSN and then fine tuning 
is performed. It has been experimentally shown that 
this method outperforms SOMP and £12 in the MMV 
problem. The proposed methods are called Nonlinear 
Weighted SOMP (NWSOMP) for the one layer model 
and DSN-WSOMP for the multilayer model. In ifTSl . a 
feedforward neural network is used to solve the SMV 
problem as a regression task. Similar to ifTTll (if we 
assume that we have only one sparse vector in ifTTll L a 
pre-training phase followed by a fine tuning is used. For 
pre-training, the authors have used Stacked Denoising 
Auto-encoder (SDA) ll2^ . Please note that an RBM with 
Gaussian visible units and binary hidden units (i.e., the 
one used in El) has the same energy function as an 
auto-encoder with sigmoid hidden units and real valued 
observations IIZTl . Therefore the extension of ifTSl to the 


MMV problem will give similar performance as that 
of ini. In m, a reconstruction method is proposed 
for sparse signals whose sparsity patterns change slowly 
with time. The main idea is to replace Compressive 
Sensing (CS) on the observation y with CS on the 
Least Squares (LS) residuals. LS residuals are calculated 
using the previous estimation of the support. In ll^ . 
a reconstruction method is proposed to recover sparse 
signals with a sparsity pattern that slowly changes over 
time. The main idea is to use Sparse Bayesian Learning 
(SBL) framework. Similar to SBL, a set of hyperpa¬ 
rameters are defined to control the sparsity of signals. 
The main difference is that the prior for each coefficient 
also involves the coefficients of the adjacent temporal 
observations. In El, a CS algorithm is proposed for 
time-varying sparse signals based on the least-absolute 
shrinkage and selection operator (Lasso). A dynamic 
Lasso algorithm is proposed for the signals with time- 
varying amplitudes and support. 

The rest of the paper is organized as follows: In 
section |II] the basics of Recurrent Neural Networks 
(RNN) with Long Short-Term Memory (LSTM) cells are 
briefly explained. The proposed method and the learning 
algorithm are presented in section m Experimental 
results on two real world datasets are presented in section 
m Conclusions and future work directions are discussed 
in section |V] Details of the final gradient expressions 
for the learning section of the proposed method are 
presented in Appendix [A| 

IT RNN WITH LSTM CELLS 

The RNN is a type of deep neural networks EH, 
that are “deep” in the temporal dimension. It has been 
used extensively in time sequence modelling Eol, ED, 

ED, E3, E3, Ea, ESI, ED, EH- if we look at the 

sparse vectors (columns) in S as a sequence, the main 
idea of using RNN for the MMV problem is to predict 
the sparsity patterns over different sparse vectors in S. 

Although RNN performs sequence modelling in a 
principled manner, it is generally difficult to learn the 
long term dependency within the sequence due to the 
vanishing gradients problem. One of the effective solu¬ 
tions for this problem in RNNs is to employ memory 
cells instead of neurons that is originally proposed in 
ID as Long Short-Term Memory (LSTM). It is further 
developed in and 1401 by adding forget gate and 
peephole connections to the architecture. 

We use the architecture of LSTM illustrated in Fig. 

for the proposed sequence modelling method for the 
MMV problem. In this figure, i(f), f(f) ,o{t) ,c(f) 
are input gate, forget gate, output gate and cell state 
vector respectively, Wpi, Wp 2 and Wp 3 are peephole 
connections, W^, W^ecz and b^, i = 1,2,3,4 are 
input connections, recurrent connections and bias values. 
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respectively, g{-) and h{-) are tanh(-) function and a(-) 
is the sigmoid function. We use this architecture to hnd 
V for each channel and then use the proposed method in 
Fig. [3 to hnd the entries that have a higher probability 
of being non-zero. Considering Fig. the forward pass 
for LSTM model is as follows: 

yg(i) = 5(W4r(f) -f Wrec4v(f - 1) + b4) 

i(t) = a(W 3 r(t) + Wrec3v(t " 1) + Wp 3 c(t - 1) + bj) 

f(t) = a(W 2 r(t) + Wrec 2 v(f - 1) + Wp 2 c(t - 1) + bs) 

c(t) = f(t) o c(t -1) + i(t) o yg(t) 

o(t) = cr(Wir(f) -f Wreciv(t - 1) -f Wpic(f) -f bi) 

v(t) = o(t) o h{c{t)) (6) 

where o denotes the Hadamard (element-wise) product. 

Summary of notations used in Fig. |^is as follows: 

• “t”: Stands for the time index in the sequence. 
For example, if we have 4 residual vectors of four 
different channels, we can show them as r(f), t = 
1,2, 3,4. 

• “1”: is a scalar 

• “Wreci, i = 1,2, 3,4”: Recurrent weight matrices 
of dimension ncell x ncell where ncell is the 
number of cells in LSTM. 

• “Wi, i = 1,2,3,4”: Input weight matrices of 
dimension M x ncell where M is the number 
of random measurements in compressive sensing. 
These matrices map the residual vectors to feature 
space. 

• “bi, i = 1,2, 3,4”: Bias vectors of size ncell x 1. 

• i = 1,2,3”: Peephole connections of di¬ 
mension ncell X ncell. 

• “v(f), t = 1,2,..., L”: Output of the cells. Vector 
of size ncell x 1. L is the number of channels in 
the MMV problem. 

. “i(f),o(f),yg(t), t = 1,2,...,L”: Input gates, 
output gates and inputs before gating respectively. 
Vector of size ncell x 1. 

• “g{-) and h{-)”: tanh(.) function. 

• “ct(.)”: Sigmoid function. 

III. Proposed Method 
A. High Level Picture 

The summary of the proposed method is presented 
in Fig. [3 We initialize the residual vector, r, for each 
channel by the measurement vector, y, of that channel. 
These residual vectors serve as the input to the LSTM 
model that captures features of the residual vectors using 
input weight matrices (Wi,W 2 ,W 3 ,W 4 ) as well as the 
dependency among the residual vectors using recurrent 
weight matrices (SN'reci,^rec 2 ,^rec 3 ,^recA) and the 
central memory unit shown in Fig. A transformation 
matrix U is then used to transform, v G the 


output of each memory cell after gating, into the sparse 
vectors space, i.e., z £ “ncell” is the number of 

cells in the LSTM model. Then a softmax layer is used 
for each channel to hnd the probability of each entry 
of each sparse vector being non-zero. For example, for 
channel 1, the j-th output of the softmax layer is: 

e^U) 

m(j)|ri)= ^ (7) 

Efc=i 

Then for each channel, the entry with the maximum 
probability value is selected and added to the support 
set of that channel. After that, given the new support 
set, the following least squares problem is solved to hnd 
an estimate of the sparse vector for the j-th channel: 

Sj = argmin||yj - A^"s ^||2 (8) 

Using Sj, the new residual value for the j-th channel is 
calculated as follows: 

(9) 

This residual serves as the input to the LSTM model at 
the next iteration of the algorithm. The stopping criteria 
for the algorithm is when the residual values are small 
enough or when it has performed N iterations where N 
is the dimension of the sparse vector. Since we have used 
LSTM cells for the proposed method, we call it LSTM- 
CS algorithm. The pseudo-code of the proposed method 
is presented in Algorithm [3 


Algorithm 1 Distributed Compressive Sensing using 

Long Short-Term Memory (LSTM-CS) 


Inputs: CS measurement matrix A G matrix of measurements Y G 

X 1 / . ^2 norm of residual matrix ‘‘‘'resMin" as stopping criterion; 

Trained model 


Output; Matrix of sparse vectors S G 9?^^^ 
Initialization: S = 0; j = 1; i = 1; = 0; R = Y. 


1 

procedure LSTM-CS(A,Y, Istm) 


2 

while i < N OT ||R ||2 resMin do 


3 

2 ■<— j + 1 


4 

lor j = 1 L do 


5 

Rf- j) ■ g— 

-S- ™ax(|R(:,3)i_i|) 


6 

Vj •<— /sfm(R(;, j) j , Vj_i , Cj_ 1 ) 

> LSTM 

7 

Zj <- Uvj 


8 

c ■<— softmax{zj) 


9 

idx Support{max{c)) 


10 

: ■<— U idx 


11 


> Least Squares 

12 



13 



14 

end for 


15 

end while 


16 

: end procedure 



We continue by explaining how the training data is 
prepared from off-line dataset and then we present the 
details of the learning method. Please note that all the 
computations explained in the subsequent two sections 
are performed only once and they do not affect the run 
time of the proposed solver in Fig. [T] It is almost as fast 
as greedy algorithms in sparse reconstruction. 
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B. Training Data Generation 

The main idea of the proposed method is to look at 
the sparse reconstruction problem as a two step task; 
a classification as the first step and a subsequent least 
squares as the second step. In the classification step, 
the aim is to find the atom of the dictionary, i.e., the 
column of A, that is most relevant to the given residual 
of the current channel and the residuals of the previous 
channels. Therefore we need a set of residual vectors 
and their corresponding sparse vectors for supervised 
training. Since the training data and A are given, we 
can imitate the steps explained in the previous section 
to generate the residuals. This means that, given a sparse 
vector s with k non-zero entries, we calculate y using 
(|^. Then we find the entry that has the maximum value 
in s and set it to zero. Assume that the index of this 
entry is kg. This gives us a new sparse vector with k—l 
non-zero entries. Then we calculate the residual vector 
from: 

i’ = y-afcoS(A:o) (10) 

Where is the fco-th column of A and s(fco) is the 
fep-th entry of s. It is obvious that this residual value 
is because of not having the remaining k — l non-zero 
entries of s. From these remaining k—l non-zero entries, 
the second largest value of s has the main contribution 
to r in Therefore, we use r to predict the location 
of the second largest value of s. Assume that the index 
of the second largest value of s is ki. We define Sq as a 
one hot vector that has value 1 at ki-th entry and zero 
at other entries. Therefore, the training pair is (r,So). 

Now we set the fci-th entry of s to zero. This gives us 
a new sparse vector with k — 2 non-zero entries. Then 
we calculate the new residual vector from: 

r^y-[a.ko,aki][s{ko),s{ki)f ( 11 ) 

We use the residual in o to predict the location of the 
third largest value in s. Assume that the index of the 
third largest value of s is k 2 . We define Sq as a one hot 
vector that has value 1 at k^-th entry and zero at other 
entries. Therefore, the new training pair is (r,So). 

The above procedure is continued upto the point that 
s does not have any non-zero entry. Then the same 
procedure is used for the next training sample. This 
gives us training samples for one channel. Then the same 
procedure is used for the next channel in S. Since the 
number of non-zero entries, k, is not known in advance, 
we assume a maximum number of non-zero entries per 
channel for training data generation. 


C. Learning Method 

To calculate the parameters of the proposed model, 
i.e., Wi,W2,W3,W4, W,ecl,W,,,2,W,ee3,W,ec4, 


Wpi, Wp 2 , Wp 3 , bi, b 2 , b 3 , b 4 in Fig. and transfor¬ 
mation matrix U in Figj^ we minimize a cross entropy 
cost function over the training data. Assuming s is the 
output vector of the softmax layer given by the model 
in Fig. [T] (output of the softmax layer is represented as 
conditional probabilities in Fig. and Sq is the one hot 
vector explained in the previous section, the following 
optimization problem is solved; 

{ nB Bsize L N 

EEEE Lr,i,r,j (A) 

i=l r=l T=1 j = l 
— '^ 0 ,r, 2 ,r (12) 

where nB is the number of mini-batches in the training 
data, Bsize is the number of training data pairs, (r, Sq), 
in each mini-batch, L is the number of channels in the 
MMV problem, i.e., number of columns of S, and N is 
the length of vector s and Sq. A denotes the collection 
of the model parameters that includes Wi, W 2 , W 3 , 

^^45 reclt rec2-> ^^rec4n 

bi, b 2 , b 3 and b 4 in Fig. |^and U in Fig.Jl] 

To solve the optimization problem in ( fl^ , we use 
Backpropagation through time (BPTT) with Nesterov 
method. The update equations for parameter A at epoch 
k are as follows: 


AAfe = Afc — Ak-i 

AAfe = /rfe_iAAfc_i — efc_iVL(Afe_i + 

(13) 


where VT( ) is the gradient of the cost function in ( fTSl i, 
e is the learning rate and gLk is a momentum parameter 
determined by the scheduling scheme used for training. 
Above equations are equivalent to Nesterov method in 
El. To see why, please refer to appendix A.l of ll 4 ^ 
where the Nesterov method is derived as a momentum 
method. The gradient of the cost function, VL(A), is: 


nB Bsize L N 

VL(A) = EE EE 

i—\ r=l T—1 j—^- 


dL 




(A) 


dA 


(14) 


one large update 

As it is obvious from ( [l4| l, since we have unfolded the 
LSTM over channels in S, we fold it back when we 
want to calculate gradients over the whole sequence of 
channels. 

*0 error signals for different 
parameters of the proposed model that are necessary for 
training are presented in Appendix Due to lack of 
space, we omit the presentation of full derivation of the 
gradients. 

We have used mini-batch training to accelerate train¬ 
ing and one large update instead of incremental up¬ 
dates during back propagation through time. To resolve 
the gradient explosion problem we have used gradient 
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Algorithm 2 Training the proposed model for Dis¬ 
tributed Compressive Sensing 

Inputs; Fixed step size “e”, Scheduling for “/r”, Gradient clip threshold 
Maximum number of Epochs “nEpoch'\ Total number of training 
pairs in each mini-batch “Bsize", Number of channels for the MMV 
problem “L”. 

Outputs: LSTM-CS trained model for distributed compressive sensing “A”. 
Initialization; Set all parameters in A to small random numbers, z = 0, 
k^l. 

procedure LSTM-CS(A) 
while i < nEpoch do 

for “first minibatch” —>• “last minibatch” do 

r ■<— 1 

while r < Bsize do 
Compute 

> use (23} to a in appendix [A| 

r ■<— r + 1 

end while 

Compute VL(Afc) ■<— “sum above terms over r” 
if VI/(Afe) > the then 
V-L(Afc) <— the 

> For each entry of the gradient matrix VL(Afe) 

end if 

Compute AAfc > use a 

Update: Afc •(- AAfc + Afc_i 
k <— k 1 

end for 
i i 1 
end while 
end procedure 


clipping. To accelerate the convergence, we have used 
Nesterov method ED and found it effective in training 
the proposed model for the MMV problem. 

We have used a simple yet effective scheduling for 
in EH’ in the first and last 10% of all parameter updates 
f^k = 0.9 and for the other 80% of all parameter updates 
/ife = 0.995. We have used a fixed step size for training 
LSTM. Please note that since we are using mini-batch 
training, all parameters are updated for each mini-batch 
in ([T4]). 

A summary of training method for LSTM-CS is 
presented in Algorithm]^ 

Although the training method and derivatives in Ap¬ 
pendix 1^ are presented for all parameters in LSTM, in 
the implementation ,we have removed peephole connec¬ 
tions and forget gates. Since length of each sequence, 
i.e., the number of columns in S, is known in advance, 
we set state of each cell to zero in the beginning of a 
new sequence. Therefore, forget gates are not a great 
help here. Also, as long as the order of columns in S 
is kept, the precise timing in the sequence is not of 
great concern, therefore, peephole connections are not 
that important as well. Removing peephole connections 
and forget gate will also help to have less training time, 
i.e., less number of parameters need to be tuned during 
training. 

IV. Experimental Results and Discussion 

We have performed the experiments on two real world 
datasets, the first is the MNIST dataset of handwritten 
digits 1431 and the second is three different classes of 


images from natural image dataset of Microsoft Research 
in Cambridge Ell¬ 
in this section, we would like to answer the following 
questions: (i) How is the performance of different recon¬ 
struction algorithms for the MMV problem, including the 
proposed method, when different channels, i.e., different 
columns in S, have different sparsity patterns? (ii) Does 
the proposed method perform well enough when there 
is correlation among different sparse vectors? E.g., when 
sparse vectors are DCT or Wavelet transform of different 
blocks of an image? (iii) How fast is the proposed 
method compared to other reconstruction algorithms for 
the MMV problem? (iv) How robust is the proposed 
method to noise? 

Eor all the results presented in this section, the recon¬ 
struction error is defined as: 

NMSE = (15) 

where S is the actual sparse matrix and S is the recov¬ 
ered sparse matrix from random measurements by the 
reconstruction algorithm. The machine used to perform 
the experiments has an Intel(R) Core(TM) i7 CPU with 
clock 2.93 GHz and with 16 GB RAM. 

A. MNIST Dataset 

MNIST is a dataset of handwritten digits where the 
images of the digits are normalized in size and centred so 
that we have fixed size images. The task is to simultane¬ 
ously encode 4 images each of size 24 x 24, i.e., we have 
4 channels and L = 4 in 0. The encoder is a typical 
compressive sensing encoder, i.e., a randomly generated 
matrix A. We have normalized each column of A to 
have unit norm. Since the images are already sparse, 
i.e., have a few number of non-zero pixels, no transform, 
in (|^, is used. To simulate the measurement noise, 
we have added a Gaussian noise with standard deviation 
0.005 to the measurement matrix Y in Q. This results 
in measurements with signal to noise ratio (SNR) of 
approximately A6dB. We have divided each image into 
four 12 X 12 blocks. This means that the length of 
each sparse vector is N = 144. We have taken 50% 
random measurements from each sparse vector, i.e., 
M = 72. After receiving and reconstructing all blocks at 
the decoder, we compute the reconstruction error defined 
in ( [TSl l for the full image. We have randomly selected 
10 images for each digit from the set {0,1, 2, 3}, i.e., 
40 images in total for the test. This means that the first 
column of S is an image of digit 0, the second column 
is an image of digit 1, the third column is an image of 
digit 2 and the fourth column is an image of digit 3. Test 
images are represented in Eig. 
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Fig. 3. Randomly selected images for test from MNIST dataset. The 
first channel encodes digit zero, the second channel encodes digit one 
and so on. 

We have compared the performance of the proposed 
reconstruction algorithm (LSTM-CS) with 7 reconstruc¬ 
tion methods for the MMV problem. These methods are; 

• Simultaneous Orthogonal Matching Pursuit 
(SOMP) which is a well known baseline for the 
MMV problem. 

• Bayesian Compressive Sensing (BCSi llTSl applied 
independently on each channel. For the BCS 
method we set the initial noise variance of i-th 
channel to the value suggested by the authors, i.e., 
sf(i(yi)^/100 where i G {1,2, 3,4} and std{.) cal¬ 
culates the standard deviation. We set the threshold 
for stopping the algorithm to 10“®. 

• Multitask Compressive Sensing (MT-BCS) ifT^ 
which takes into account the statistical dependency 
of different channels. For MT-BCS we set the 
parameters of the Gamma prior on noise variance 
to a = 100/0.1 and 6=1 which are the values 
suggested by the authors. We set the stopping 
threshold to 10“® as well. 

• Sparse Bayesian Learning for Temporally correlated 
sources (T-SBL) |[T3l which exploits correlation 
among different sources in the MMV problem. For 
T-SBL, we used the default values proposed by the 
authors. 

. Nonlinear Weighted SOMP (NWSOMP) H?) which 
solves a regression problem to help the SOMP 
algorithm with prior knowledge from training data. 
For NWSOMP, during training, we used one layer, 
512 neurons and 25 epochs of parameters update. 

• Compressive Sensing on Least Squares Residual 
(LSCS) d where no explicit joint sparsity as¬ 
sumption is made in the design of the method. For 
LSCS, we used sigmaO = cc*{l/3)*sqrt{Sav/m) 
suggested by the authors where m is the number of 
measurements and Sav = 16 as suggested by the 
author. We tried a range of different values of cc 
and got the best results with cc = 0.1. We also 



Number of non-zero entries in the sparse vector 

Fig. 4. Comparison of different MMV reconstniction algorithms for 
MNIST dataset. Bottom figure is the same as top figure without results 
of BCS algorithm to make the difference among different algorithms 
more visible. In this experiment M = 72 and N = 144. 

set sigsys = 1, siginit — 3 and lamhdap = 4 as 
suggested by the author. 

• The method proposed in EQi, ia and referred to 
as PCSBL-GAMP where sparse Bayesian learning 
is used to design the method and no explicit joint 
sparsity assumption is made. For PCSBL-GAMP, 
we used beta = 1, Pattern = 2 because we need 
the coupling among the sparse vectors, i.e., left and 
right coupling, maximum number of iterations equal 
to maxiter = 400, and C = leO as suggested by 
the authors for the noisy case. 

For LSTM-CS, during training, we used one layer, 
512 cells and 25 epochs of parameter updates. We used 
only 200 images for the training set. The training set 
does not include any of the 40 images used for test. 
To monitor and prevent overfitting, we used 3 images 
per channel as the validation set and we used early 
stopping if necessary. Please note that the images used 
for validation were not used in the training set or in the 
test set. Results are presented in Fig. 

In Fig. 1^ the vertical axis is the NMSE defined in 
( [T5| ) and horizontal axis is the number of non-zero entries 
in the sparse vector. The number of measurements, M, 
is fixed to 72. Each point on the curves in Fig. [^is the 
average of NMSE over 40 reconstructed test images at 
the decoder. 

For the MNIST dataset, we observe from Fig. I^that 
LSTM-CS significantly outperforms the reconstruction 
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Fig. 5. Reconstructed images using different MMV reconstruction algorithms for 4 images of the MNIST dataset. First row are original images, 
S, second row are measurement matrices, Y, third row are reconstructed images using LS-CS, fourth row are reconstructed images using 
SOMP, fifth row using PCSBL-GAMP, sixth row using MT-BCS, seventh row using T-SBL, eighth row using NWSOMP and the last row are 
reconstructed images using the proposed LSTM-CS method. 


algorithms for the MMV problem discussed in this paper. 
One important reason for this is that existing MMV 
solvers rely on the joint sparsity in S, while the proposed 
method does not rely on this assumption. Another reason 
is that the structure of each sparse vector is effectively 
captured by LSTM. The reconstructed images using dif¬ 
ferent MMV reconstruction algorithms for 4 test images 
are presented in Fig. An interesting observation from 
Fig. 1^ is that the accuracy of reconstruction depends on 
the complexity of the sparsity pattern. For example when 
the sparsity pattern is simple, e.g., image of digit 1 in 
Fig. all the algorithms perform well. But when the 
sparsity pattern is more complex, e.g., image of digit 
0 in Fig. 1^ then their reconstruction accuracy degrades 
significantly. 

We have repeated the experiments on the MNIST 
dataset with 25% random measurements, i.e., M = 36. 
The results are presented in Fig.|^ We trained 4 different 
LSTM models for this experiment. The first one is the 
same model used for previous experiment (m — 72). 
In the second model, we increased the number of cells 
in the LSTM model from 512 to 1024. In the third and 
fourth models, we used 2 times and 4 times more training 
data respectively. The rest of the experiments’ settings 
was similar to the settings described before. As observed 
from these results, by investing more on training a good 
LSTM model, LSTM-CS method performs better. 

All the results presented so far are for noisy measure¬ 


ments where an additive Gaussian noise with standard 
deviation 0.005 is used (SNR ~ A6dB). To evaluate the 
stability of the proposed LSTM-CS method to noise, and 
compare it with other methods discussed in this paper, 
an experiment was performed using the following range 
of noise standard deviations: 

cr = {0.5,0.2,0.1,0.05,0.01,0.005} (16) 

where cr is the standard deviation of noise. This approx¬ 
imately corresponds to: 

SNR = (6 dB, 14 dB, 20 dB, 26 dB, 40 dB, 46 dB} 

(17) 

We used the same experimental settings explained above. 
Results are presented in Fig. 

As observed from the results, in very noisy envi¬ 
ronment, i.e., SNR = 6 dB, performance of MT-BCS 
, LSCS and PCSBL-GAMP degrades significantly while 
T-SBL , NWSOMP and LSTM-CS (proposed in this 
paper) methods show less severe degradation. In very 
low noise environment, i.e., SNR = 46 dB, performance 
of LSTM-CS, trained with just 512 cells and 200 training 
images, is better than other methods. In medium noise 
environment, i.e., SNR = 20 dB and SNR = 26 dB, 
performance of LSTM-CS, T-SBL and PCSBL-GAMP 
are close (although LSTM-CS is slightly better). Please 
note that the performance of LSTM-CS can be further 
improved by using a better architecture (e.g., more 
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Fig. 7. 


(a) Results for all Methods. 


(b) Results without BCS method for a more clear visibility. 


Reconstruction performance of the methods discussed in the paper for different noise levels. 
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Fig. 6. Comparison of different MMV reconstruction algorithms for 
MNIST dataset. Bottom figure is the same as top figure without results 
of BCS algorithm to make the difference among different algorithms 
more visible. In this experiment M = 36 and N = 144. 


cells, more training data or more layers) as explained 
previously. 

To present the phase transition diagram of solvers, we 
used a simple LSTM-CS solver that uses 512 cells and 
just 200 training images. The performance was evaluated 
over the following values of — where n is the number 
of entries in each sparse vector and m is the number of 


measurements per channel; 

TTL 

— = {0.10,0.15,0.20,0.25,0.30,0.35,0.40,0.45,0.50} 
n 

(18) 

For this experiment, we randomly selected 50 images 
per channel from MNIST dataset. Since we have L = 4 
channels, and each image is of size 24 x 24, and each 
image has 4 blocks of 12x 12 pixels, in total we will have 
50x4x4 = 800 sparse vectors. Considering Fig.j^of the 
paper, NMSE of most solvers is about 0.6. Therefore we 
set the following as the condition for perfect recovery: if 
more than 90% of test images are reconstructed with an 
NMSE of 0.6 or less, count that test image as perfectly 
recovered. We did this for each ^ in Results are 
presented in Eig. Results presented in Eig|^ shows the 
reconstruction performance improvement when LSTM- 
CS method is used. 

We also present the performance of LSTM-CS for 
different number of random measurements. We used the 
set of random measurements in ([U with n = 144. 
We used an LSTM with 512 cells and 400 training 
images. The settings for all other methods was similar to 
the one described before. Results are presented in Eig. 

As observed from Eig. using LSTM-CS method 
improves the reconstruction performance compared to 
other methods discussed in this paper. 


B. Natural Images Dataset 

Eor experiments on natural images we used the MSR 
Cambridge dataset m. Ten randomly selected test 
images belonging to three classes of this dataset are used 
for experiments. The images are shown in Eig. 10 We 
have used 64 x 64 images. Each image is divided into 
8x8 blocks. After reconstructing all blocks of an image 
in the decoder, the NMSE for the reconstructed image 
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Number of Random Measurements (m) Number of Random Measurements (m) 

(a) Results for all Methods. (b) Results without BCS method for a more clear visibility. 

Fig. 9. Comparison of different MMV reconstruction algorithms for different number of random measurements for MNIST dataset. In this 
experiment n = 144. 



Fig. 8. Phase transition diagram for different methods on MNIST 
dataset where 90% perfect recovery is considered. Assuming a perfect 
recovery condition of NMSE < 0.6 for this dataset, “n” is the 
number of entries in each sparse vector, “m” is the number of random 
measurements and “fc” is the number of non-zero entries in each sparse 
vector. 


is calculated. The task is to simultaneously encode 4 
blocks (L = 4) of an image and reconstruct them in the 
decoder. This means that S in has 4 columns each 
one having = 64 entries. We used 50% measurements, 
i.e., Y in @ have 4 columns each one having M = 32 
entries. 

We have compared the performance of the proposed 
algorithm, LSTM-CS, with SOMP, T-SBL, MT-BCS and 
NWSOMP. We have not included results of applying 
BCS per channel due its weak performance compared 
to other methods (this is shown in the experiments for 
MNIST dataset). We have used the same setting as the 
settings for the MNIST dataset for different methods 
which is explained in the previous section. The only 


differences here are: (i) For each class of images, we 
have used just 55 images for training set and 5 images 
for validation set which do not include any of 10 images 
used for test, (ii) We have used 15 epochs for training 
LSTM-CS which is enough for this dataset, compared 
to 25 epochs for the MNIST dataset. The experiments 
were performed for two popular transforms, DCT and 
Wavelet, for all aforementioned reconstruction algo¬ 
rithms. For the wavelet transform we used Haar wavelet 
transform with 3 levels of decomposition. Results for 
DCT transform are presented in Fig. 
wavelet transform are presented in Fig. 


Results for 


To conclude the experiments section, the CPU time 
for different reconstruction algorithms for the MMV 


problem discussed in this paper are presented in Fig. 13 


Each point on the curves in Fig. 13 is the time spent to 
reconstruct each sparse vector averaged over all the 8x8 
blocks in 10 test images. We observe from this figure 
that the proposed algorithm is almost as fast as greedy 
algorithms. Please note that there is a faster version of 
T-SBL that is known as TMSBL. It will improve the 
CPU time of T-SBL but it is still slower than other 
reconstruction methods. 


V. Conclusions and Future Work 

This paper presents a method to reconstruct sparse 
vectors for the MMV problem. The proposed method 
learns the structure of sparse vectors and does not rely on 
the commonly used joint sparsity assumption. Through 
experiments on two real world datasets, we showed that 
the proposed method outperforms the general MMV 
baseline SOMP as well as a number of Bayesian model 
based methods for the MMV problem. Please note 
that we have not used multiple layers of LSTM or 
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Fig. 10. Randomly selected natural images from three different classes used for test. The first row are “buildings”, the second row are “cows” 
and the third row are “flowers”. 


the advanced deep learning methods for training, e.g., 
regularization using drop out which can improve the 
performance of LSTM-CS. This paper is a proof of 
concept that deep learning methods and specifically 
sequence modelling methods, e.g., LSTM, can improve 
the performance of the MMV solvers significantly. This 
is specially the case when the sparsity patterns are more 
complicated than that of obtained by the DCT or Wavelet 
transforms. We showed this on the MNIST dataset. 
Please note that if collecting training samples is expen¬ 
sive or enough training samples are not available, using 
other sparse reconstruction methods is recommended. 
Our future work includes: 1) Extending the LSTM-CS 
to bidirectional LSTM-CS. 2) Extending the proposed 
method to non-linear distributed compressive sensing. 
3) Using the proposed method for video compressive 
sensing where there is correlation amongst the video 
frames, and compressive sensing of EEG signals where 
there is correlation amongst the different EEG channels. 

VI. Acknowledgement 

We want to thank the authors of ini, Ea, 
and II 20 I for making the code of their work available. 
This was important in performing comparisons. Eor re¬ 
producibility of the results, please contact the authors for 
the MATLAB codes of the proposed LSTM-CS method. 
We also want to thank WestGrid and Compute Canada 
Calcul Canada for providing computational resources for 
part of this work. 


Appendix A 

Expressions eor the Gradients 

In this appendix we present the final gradient expres¬ 
sions that are necessary to use for training the proposed 
model for the MMV problem. Due to lack of space, 
we omit the presentation of full derivations of these 
gradients. 


Starting with the cost function in ( [T^ , we 
use the Nesterov method described in (El to 
update LSTM-CS model parameters. Here, A 
is one of the weight matrices or bias vectors 
{ Wi, W2, W3, W4, , W,ec2, , W,ec 4 

,Wpi,Wp2,Wp3,bi,b2,b3,b4} in the LSTM-CS 
architecture. The general format of the gradient of the 
cost function, VL(A), is the same as d- To calculate 
from ( fT^ we have: 

dLriri-^) /in'. 

i=i 


After a straightforward derivation of derivatives we will 
have: 


9Lr-,i.r(A) 

9A 








( 20 ) 


where z,- is the vector z for r-th channel in Eig. [T] and 
/? is a scalar defined as: 


N 

/3 = 5Is0.r.*.r(j) (21) 

i=i 

Since during training data generation we have generated 
one hot vectors for Sq, /? always equals to 1. Since we are 
looking at different channels as a sequence, for a more 
clear presentation we show any vector corresponding to 
f-th channel with (t) instead of index t. Eor example, 
Zt- is represented by z{t). 

Since z{t) = Uv(f) we have: 


dA dA 


( 22 ) 


Combining ( [20l l, ( |2T| l and 




we will have: 

9v(f) 


= U (sr,i(f) - so,r,j(f)) 


dA ^ dA 

Starting from “t = L”-th channel, we define e{t) as: 


(23) 


e(f) = U^(s/.^i(f) - so,r-.i(t)) (24) 
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Fig. 11. Comparison of different MMV reconstruction algorithms 
for natural image dataset using DCT transform and just one layer 
for LSTM model in LSTM-CS. Image classes from top to bottom 
respectively: buildings, cows and flowers. 


Fig. 12. Comparison of different MMV reconstruction algorithms 
for natural image dataset using Wavelet transform and just one layer 
for LSTM model in LSTM-CS. Image classes from top to bottom 
respectively: buildings, cows and flowers. 


The expressions for the gradients for different parameters 
of LSTM-CS model are presented in the subsequent 
sections. We omit the subscripts r and i for simplicity 
of presentation. Please note that the final value of the 
gradient is sum of gradient values over the mini-batch 
samples and number of channels as represented by 
summations in ([T4J. 


A. Output Weights U 


^ = (s(t) -So(f)).v(f)^ 


B. Output Gate 

For recurrent connections we have: 
dLt 


dWr 




where 

^recl(^) _ o (1 — o(f)) O h{c{t)) O e{t) 


(27) 


For input connections, Wi, and peephole connections, 
Wpi, we will have: 


(25) 


dLt 

dWi 

dLt 




= 6^‘^'=yt).c{ty 


(26) 


dWpi 

The derivative for output gate bias values will be: 

abi 


(28) 


(29) 


( 30 ) 
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For bias values, ba, we will have: 

fxrec3u\\ 

Sbi = '*“»<* <‘))^W 


where 


aba 


aba 


(37) 


®'<‘>=<ii„9(f(i))^M^+b.(() (38) 


D. Forget Gate 

For the recurrent connections we will have: 


aw. 


rec2 


aw. 


rec2 


where 


(39) 


^rec 2 (^) = (1 _ h{c{t))) o (1 + h{c{t))) o o(i) o e(t) 

+ b/(t).v(t - l)"^ 


dWrec2 dWrec2 

hf{t) = c{t — 1) o i(t) o (1 — f(i)) (40) 

For input connections to forget gate we will have: 


Fig. 13. CPU time for different MMV reconstruction algorithms. These 
times are for the experiment using DCT transform for 10 test images 
from the building class. The bottom figure is the same as top figure 
but without T-SBL and LS-CS to make the difference among different 
methods more clear. 


C. Input Gate 

For the recurrent connections we have: 

aW,ec3 ^^^'dWrec3 

where 


(31) 


= (1 — h{c{t))) o (1 + h{c(t))) o o{t) o e{t) 
+ h,{t).v{t - if 


dWrec3 dWrec3 

b 3 (f) =yg(f)oi(f)o(l-i(f)) 


(32) 


For the input connections we will have the following: 

(33) 




aW; 


awa 


where 

dc{t) 


= + b,(f).r(f)^ 


aWa ^ aWa 

For the peephole connections we will have: 


aw. 


p3 


’aw. 


p3 


(34) 


(35) 


where 

dc{t) 


aw. 


p3 


= diag{l(t)). ^^!'l + b,(f).c(f- 1)^ (36) 


aw. 


p3 


^ = diagim). + hf{t).r{tf (42) 

For peephole connections we have: 

^=d^agiS^^^ft)).^ (43) 

^Wp2 d\\p2 

where 

=<i,-<,9(f(()),^^^ + b_,((),c((-l)^ (44) 

For forget gate’s bias values we will have: 

(45) 


where 


E. Input without Gating (yg{t)) 

For recurrent connections we will have: 


dLt 

dW red 


diagf^^^^ft)) 


dc{t) 
aw rec4 


where 


(47) 


^rec4(^) 

dc{t) 
dlAI rec4 

bg(i) = 


= (1 — h{c{t))) o (1 + h{c{t))) o o(f) o e{t) 

= dmg(f(f)). ^^ + bg(f).v(f - 1)^ 

i(f)o(l-yg(f))o(l + yg(f)) (48) 
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For input connections we have: 


9Lt rcrec4u\\ 


where 

dc{t) 


aw4 aw4 

For bias values we will have; 




where 


dhA 


dhA 


(49) 


= diag{i{t)). ^^^ + hg{t).r{tf (50) 


( 51 ) 


^<^^=d^agim)■"-^+^At) ( 52 ) 


F. Error signal backpropagation 

Error signals are back propagated through time using 
following equations: 

jrecl^t — 1) = [o(f — 1) O (1 — oit — 1)) O h{c{t — 1))] 
o[WLi-^'’“'(i) + e(f-l)] (53) 

jreci(^ - 1) = [(1 - h{c(t - 1))) O (1 + h{c{t - 1))) 

oo(f- 1)] o +e(f- 1)], 

for tG {2,3,4} (54) 
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